Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(pgwire): unify query logs and slow query logs #13924

Merged
merged 5 commits into from
Dec 18, 2023

Conversation

BugenZhao
Copy link
Member

@BugenZhao BugenZhao commented Dec 11, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Provide the common info like mode, session, sql with a tracing span in the entrance function do_process, which...

The slow query log was introduced in #6624 a long time ago. However, it has recently been discovered (in #13287 (comment) and also by @liurenjie1024) that it fails to take into account the time required to consume the data stream. This PR also fixes that by handling all these stuff in pgwire.

Tested locally:

  • query.log
2023-12-12T16:43:36.017757+08:00 risingwave-main ThreadId(10) handle_query{mode="simple query" session_id=0 sql=select fu;}: status="err" time=68.853375ms
2023-12-12T16:45:20.525747+08:00 risingwave-main ThreadId(06) handle_query{mode="simple query" session_id=0 sql=select pg_sleep(65);}: status="ok" time=65.020061167s
  • slow_query.log
2023-12-12T16:45:15.508607+08:00 risingwave-main ThreadId(07) handle_query{mode="simple query" session_id=0 sql=select pg_sleep(65);}: slow query elapsed=60s

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

src/common/src/util/query_log.rs Show resolved Hide resolved
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: Bugen Zhao <[email protected]>
@BugenZhao BugenZhao requested review from chenzl25 and xxchan December 12, 2023 09:11
Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Attention: 68 lines in your changes are missing coverage. Please review.

Comparison is base (942a826) 68.00% compared to head (0f92b6b) 68.07%.
Report is 1 commits behind head on main.

Files Patch % Lines
src/utils/runtime/src/logger.rs 0.00% 42 Missing ⚠️
src/utils/pgwire/src/pg_protocol.rs 81.81% 18 Missing ⚠️
src/frontend/src/session.rs 27.27% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13924      +/-   ##
==========================================
+ Coverage   68.00%   68.07%   +0.06%     
==========================================
  Files        1542     1543       +1     
  Lines      266905   266920      +15     
==========================================
+ Hits       181520   181714     +194     
+ Misses      85385    85206     -179     
Flag Coverage Δ
rust 68.07% <57.23%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good Catch

Comment on lines +251 to +264
/// Newtype wrapper for `DefaultFields`.
///
/// `fmt::Layer` will share the same `FormattedFields` extension for spans across
/// different layers, as long as the type of `N: FormatFields` is the same. This
/// will cause several problems:
///
/// - `with_ansi(false)` does not take effect and it will follow the settings of
/// the primary fmt layer installed above.
/// - `Span::record` will update the same `FormattedFields` multiple times,
/// leading to duplicated fields.
///
/// As a workaround, we use a newtype wrapper here to get a different type id.
/// The const generic parameter `SLOW` is further used to distinguish between the
/// query log and the slow query log.
Copy link
Member

@xxchan xxchan Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, what part of the following log is controlled by this? (What does it look like without this?)

2023-12-12T16:43:36.017757+08:00 risingwave-main ThreadId(10) handle_query{mode="simple query" session_id=0 sql=select fu;}: status="err" time=68.853375ms

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- handle_query
+ <some control character for ANSI color>handle_query
- {mode="simple query" session_id=0 sql=select fu;}:
+ {mode="simple query" session_id=0 sql=select fu; sql=select fu; sql=select fu;}:
 status="err" time=68.853375ms

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +216 to +221
/// The span is only effective if there's a current session and the message is
/// query-related. Otherwise, `Span::none()` is returned.
fn root_span_for_msg(&self, msg: &FeMessage) -> tracing::Span {
let Some(session_id) = self.session.as_ref().map(|s| s.id().0) else {
return tracing::Span::none();
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is there no session? 👀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess when the Startup message is not handled?

@@ -460,26 +554,12 @@ where
async fn process_query_msg(&mut self, query_string: io::Result<&str>) -> PsqlResult<()> {
let sql: Arc<str> =
Arc::from(query_string.map_err(|err| PsqlError::SimpleQueryError(Box::new(err)))?);
let start = Instant::now();
record_sql_in_span(&sql);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little confused. Does this duplicate with the layer in do_process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the layer in do_process?

What do you mean by this? 😳

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do_process emits the log line, but process_query_msg adds the sql field. I see.

/// The const generic parameter `SLOW` is further used to distinguish between the
/// query log and the slow query log.
#[derive(Default)]
struct FmtFields<const SLOW: bool>(DefaultFields);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the SLOW used?

Copy link
Member Author

@BugenZhao BugenZhao Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let layer = match is_slow {
true => layer.fmt_fields(FmtFields::<true>::default()).boxed(),
false => layer.fmt_fields(FmtFields::<false>::default()).boxed(),
};

There's no actual use for this flag. It just makes the compiler to treat them as different types and then assign them with different type_ids, which is used for RTTI.

tracing::info!(
target: PGWIRE_QUERY_LOG,
status = if result.is_ok() { "ok" } else { "err" },
time = ?elapsed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, shall we print the elapsed in a fixed width and fixed unit format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more? I'm afraid it's hard to do this for the duration from 50 µs = 0.05 ms to 5 min = 300 000 ms. 🥺

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ms is enough for batch query. How about using .as_millis()? With the same time unit could let us easy to grep these lines.

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I feel that "unifying query logs and slow query logs" doesn't seem to reduce the complexity but somehow make it more complicated to me... Tend to keep the original implementation, which was simple and straight-forward.

@@ -160,6 +158,17 @@ pub fn cstr_to_str(b: &Bytes) -> Result<&str, Utf8Error> {
std::str::from_utf8(without_null)
}

/// Record `sql` in the current tracing span.
fn record_sql_in_span(sql: &str) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship between this and the run_stmt in #10660?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you're talking about the span created here:

tracing::info_span!(
target: PGWIRE_ROOT_SPAN_TARGET,
"handle_query",
mode,
session_id,
sql = tracing::field::Empty, // record SQL later in each `process` call
)

This replaces the old one, which was renamed to process_query_msg_one_stmt after that PR. Pros:

  • It's defined in the unified path for handling Query Parse and Execute, while the old one only handled (simple) Query.
  • It's now not only used by traces, but also used to provide common info for the query logs later, including the slow ones.
    tracing::info!(
    target: PGWIRE_QUERY_LOG,
    status = if result.is_ok() { "ok" } else { "err" },
    time = ?elapsed,
    );

    tracing::info!(
    target: PGWIRE_SLOW_QUERY_LOG,
    ?elapsed,
    "slow query"
    );

As for recording the SQL field after creating the span: I just don't want to duplicate the code for extracting the SQL from the messages. This also reduces allocations.

Comment on lines +298 to +299
// Query log.
let fut = async move {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the consecutive async move layers have performance implications? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not intuitively. 🤣 This is just like the adaptors for iterators or futures.

@BugenZhao
Copy link
Member Author

BugenZhao commented Dec 13, 2023

Overall, I feel that "unifying query logs and slow query logs" doesn't seem to reduce the complexity but somehow make it more complicated to me... Tend to keep the original implementation, which was simple and straight-forward.

To be honest, this PR is more like a fix and a feature than a refactor.

  • Frontend tracing now works in extended mode as well.
  • The slow query log now correctly records the time by moving it to pgwire.

If the refactor were not done, then the code for providing the common fields like sql or session_id has to be repeated 3 (log, slow log, trace span) x 3(query, parse, execute) = 9 times, also including the code for truncating and possibly redacting (#12673). I believe DRY is necessary here. 😕

New code introduced in this PR is scalable. For example, it requires minimal changes to

  • add or transform a field
  • support a new query type

@BugenZhao BugenZhao added this pull request to the merge queue Dec 18, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 18, 2023
@BugenZhao BugenZhao enabled auto-merge December 18, 2023 07:19
@BugenZhao BugenZhao added this pull request to the merge queue Dec 18, 2023
Merged via the queue into main with commit d1128cb Dec 18, 2023
26 of 27 checks passed
@BugenZhao BugenZhao deleted the bz/unified-query-log branch December 18, 2023 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants